MUSCA: An Algorithm for Constrained Alignment of Multiple Data Sequences.

نویسندگان

  • Parida
  • Floratos
  • Rigoutsos
چکیده

Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. MUSCA is a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems whose solutions are used to obtain the alignment of the sequences. We first discover motifs in the N sequences and then extract an appropriate subset of compatible motifs to obtain a good alignment. The motifs of interest to us are the irredundant motifs which are only polynomial in the input size. In practice, however, the number is much smaller (sub-linear). Notice that this step aids in a direct N-wise alignment, as opposed to composing the alignments from lower order (say pairwise) alignments and the solution is also independent of the order of the input sequences; hence the algorithm works very well while dealing with a large number of sequences. The second part of the problem that deals with obtaining a good alignment is solved using a graph-theoretic approach that computes an induced subgraph satisfying certain simple constraints. We reduce a version of this problem to that of solving an instance of a set covering problem, thus offer the best possible approximate solution to the problem (provided P not equalNP). Our experimental results, while being preliminary, indicate that this approach is efficient, particularly on large numbers of long sequences, and, gives good alignments when tested on biological data such as DNA and protein sequences. We introduce the the notion of an alignment number K (2 </= K </= N), a user-controlled parameter, that lends a useful flexibility to the aligning program: this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever possible, in the alignment. The usefulness of the alignment number is corroborated by the users who view this as a natural constraint while dealing with a large number of sequences.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Application of the ABS LX Algorithm to Multiple Sequence Alignment

We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

An Approximation Algorithm for Alignment of Multiple Sequences using Motif Discovery

Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. The quality of the alignment is usually measured by penalizing the mis-matches and gaps, and rewarding the matches with appropriate weight functions. However for larger values of N , additional constraints are required to...

متن کامل

A space-efficient algorithm for the constrained pairwise sequence alignment problem.

The constrained pairwise sequence alignment (CPSA) problem aims to align two given sequences by aligning their similar subsequences in the same region under the guidance of a given pattern (constraint). Let the lengths of the sequences be m, and n where n <or= m, and let r <or= n be the length of the given pattern. The optimum constrained pairwise alignment score can be computed using O(rn) spa...

متن کامل

A Fast Algorithm for the Constrained Multiple Sequence Alignment Problem

Given n strings S1, S2, ..., Sn, and a pattern string P , the constrained multiple sequence alignment (CMSA) problem is to find an optimal multiple alignment of S1, S2, . . . , Sn such that the alignment contains P , i.e. in the alignment matrix there exists a sequence of columns each entirely composed of symbol P [k] for every k, where P [k] is the kth symbol in P , 1 ≤ k ≤ |P |, and in the se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Genome informatics. Workshop on Genome Informatics

دوره 9  شماره 

صفحات  -

تاریخ انتشار 1998